The focus of the lab is to perform two specific tasks in prediction based on the dataset:
The UK government has gathered traffic accident data from the years 2000 to 2016, capturing data for over 1.6 million accidents. The data is collected from police reports and does not include fender bender type incidents. In order to reduce the complexity and resource requirements, we will be working on the accidents occured during 2012.
This dataset is originally from the UK government website but is hosted by BigML Inc. It can be downloaded fro the link: https://bigml.com/user/czuriaga/gallery/dataset/525dcd15035d076e7f00e3ac.
# Import the libraries needed for the lab
import pandas as pd
import numpy as np
# Vizualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
from sklearn import metrics as mt
import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
# Load the UK Accidents 2012 dataset
df = pd.read_csv('../../data/UKAccidents.csv')
print('UK Accidents dataset dimensions: ', df.shape)
Points: 10
Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
Some of the records in the dataset appear to be duplicate, but that was not the case once they were observed closely. When two or more vehicles are involved in an accident, there is an instance for each vehicle with the location and other features being the same; however, the driver and vehicle information is different.
Based on our undestanding and manual analysis of the data, the following attributes will be removed as they are related to accident severity.
We will need to perform one hot encoding for all the categorical variables.
# Remove unwanted attributes
df.drop(['1st Road Number', '2nd Road Number', 'Date', 'Date.year', 'Location Easting OSGR', 'Location Northing OSGR',
'Local Authority (District)', 'Local Authority (Highway)', '1st Road Class', '2nd Road Class',
'Pedestrian Crossing-Human Control', 'Pedestrian Crossing-Physical Facilities', 'Special Conditions at Site',
'Carriageway Hazards', 'Did Police Officer Attend', 'LSOA of Accident Location', 'Towing and Articulation',
'Vehicle Location-Restricted Lane', 'Police Force'], axis=1, inplace=True)
# Drop all the instances with NaN values
df.dropna(inplace=True)
y_classification = df['Accident Severity']
X_classification = df
X_classification.drop(['Accident Severity'], axis=1, inplace=True)
y_regression = df['Engine Capacity (CC)']
X_regression = df
X_regression.drop(['Engine Capacity (CC)'], axis=1, inplace=True)
# List all the categorical attributes identified for one hot encoding
categorical_attributes = ['Road Type', 'Junction Detail', 'Junction Control', 'Light Conditions', 'Weather Conditions',
'Road Surface Conditions', 'Urban or Rural Area', 'Vehicle Type', 'Vehicle Manoeuvre',
'Junction Location', 'Skidding and Overturning', 'Hit Object in Carriageway',
'Vehicle Leaving Carriageway', 'Hit Object off Carriageway', '1st Point of Impact',
'Was Vehicle Left Hand Drive?', 'Journey Purpose of Driver', 'Sex of Driver', 'Age Band of Driver',
'Propulsion Code', 'Driver IMD Decile', 'Driver Home Area Type']
# Change the data type for the categorical attributes
for feature in list(df.columns):
if feature in categorical_attributes:
X_classification[feature] = X_classification[feature].astype('category')
X_regression[feature] = X_regression[feature].astype('category')
# one hot encoding
X_classification = pd.get_dummies(X_classification, sparse=True)
X_regression = pd.get_dummies(X_regression, sparse=True)
print(X_classification.shape)
print(X_regression.shape)
# Let's standardize the continuous attributes and see
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_classification)
X_classification_std = sc.transform(X_classification)
sc.fit(X_regression)
X_regression_std = sc.transform(X_regression)
# Write the modified dataframe to a CSV file so that we can use it elsewhere
#temp = pd.get_dummies(df, sparse=True)
#temp.to_csv('../../data/UKAccidents_Cleaned.csv', index=False)
Points: 5
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
The dataset for the classification task contains 190 attributes including the one hot encoded attributes. The response is Accident Severity which is stored in y_classification and has three classes, 'Slight', 'Serious' and 'Fatal'.
print(X_classification.shape)
print(X_classification.info())
print(y_classification.value_counts())
X_classification.head()
The dataset for the regression task also contains 190 attributes including the one hot encoded attributes. The response is Engine Capacity which is stored in y_regression.
In the engine capacity data, we found there was an outlier that had a value of 91000. We considered that as an anamoly and removed the data point for our models.
import plotly.offline as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
b1 = go.Box(
y=y_regression,
name="Engine Capacity - Regression Task"
)
data = [b1]
py.iplot(data)
index = y_regression[y_regression >= 91000].index[0]
print(index)
y_regression = y_regression.drop(index=index)
X_regression_std = np.delete(X_regression_std, [index], axis=0)
print(X_regression_std.shape)
b1 = go.Box(
y=y_regression,
name="Engine Capacity - Regression Task"
)
data = [b1]
py.iplot(data)
print(X_regression.shape)
print(X_regression.info())
X_regression.head()
Points: 10
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions. </span>
Accuracy is one of the metrics used to evaluate a classification model. Accuracy of a model is the proportion of the total number of predictions that were correct. We will use sklearn accuracy_score function to calculate the accuracy of the models.
Precision or Positive Predictive Value is the proportion of positive cases that were correctly identified. Recall or Sensitivity is the proportion of actual positive cases which are correctly identified.
F1 score or F-Measure is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
#
# Code Reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-download-auto-examples-model-selection-plot-confusion-matrix-py
#
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
np.set_printoptions(precision=2)
#
# Code from Dr. Larson's notebook 06. Classification
#
def per_class_accuracy(ytrue, yhat):
conf = mt.confusion_matrix(ytrue, yhat)
norm_conf = conf.astype('float') / conf.sum(axis=1)[:, np.newaxis]
return np.diag(norm_conf)
def plot_class_acc(ytrue, yhat, title=''):
acc_list = per_class_accuracy(ytrue, yhat)
print(acc_list)
plt.bar(range(len(acc_list)), acc_list)
plt.xlabel('Class value (one per face)')
plt.ylabel('Accuracy within class')
plt.title(title+", Total Acc=%.1f"%(100*mt.accuracy_score(ytrue,yhat)))
plt.grid()
plt.ylim([0,1])
plt.show()
We will use the following metrics to evaluate the regression models.
Points: 10
Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.
</span>
Since we have enough observations, we performed a 80/20 split for the test and training data. Since the first task is a classification problem with 3 classes, we wanted to have an equal frequency distribution of the classes among original, train and the test datasets. We used the stratify option in the train_test_split function to achieve that. we verified the frequncy distribution of the classes among the dataset using bar charts.
# Split training and testing data 80/20
from sklearn.model_selection import train_test_split
X_classification_train, X_classification_test, y_classification_train, y_classification_test = train_test_split(X_classification_std, y_classification, test_size=0.2, random_state=1, stratify=y_classification)
counts_original = y_classification.value_counts().to_frame()
counts_train = y_classification_train.value_counts().to_frame()
counts_test = y_classification_test.value_counts().to_frame()
trace1 = go.Bar(
x=counts_original.index,
y=counts_original['Accident Severity'],
name='Original Data'
)
trace2 = go.Bar(
x=counts_train.index,
y=counts_train['Accident Severity'],
name='Training Data'
)
trace3 = go.Bar(
x=counts_test.index,
y=counts_test['Accident Severity'],
name='Test Data'
)
fig = tools.make_subplots(rows=1, cols=3)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig['layout'].update(height=600, width=800, title='Frequency distribution of classes between datasets',
xaxis=dict(title='Accident Severity'),
yaxis=dict(title='Observations Count'))
py.iplot(fig)
Since we have enough observations, we performed a 80/20 split for the test and training data for the second task of regression task as well.
X_regression_train, X_regression_test, y_regression_train, y_regression_test = train_test_split(X_regression_std, y_regression, test_size=0.2, random_state=1)
import plotly.offline as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
b1 = go.Box(
y=y_regression,
name="Engine Capacity - Regression Task"
)
data = [b1]
py.iplot(data)
Points: 20
Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!
</span>
from sklearn.ensemble import RandomForestClassifier
def random_forest_classifier(features, target):
"""
To train the random forest classifier with features and target data
:param features:
:param target:
:return: trained random forest classifier
"""
clf = RandomForestClassifier()
clf.fit(features, target)
return clf
# Create random forest classifier instance
%time task1_model1_rf = random_forest_classifier(X_classification_train, y_classification_train)
print ("Trained model :: ", task1_model1_rf)
%time predictions_task1_model1_rf = task1_model1_rf.predict(X_classification_test)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
%time task1_model2_knn = KNeighborsClassifier(n_neighbors=3)
%time task1_model2_knn.fit(X_classification_train, y_classification_train)
print ("Trained model :: ", task1_model2_knn)
%time predictions_task1_model2_knn = task1_model2_knn.predict(X_classification_test)
from sklearn.ensemble import GradientBoostingClassifier
task1_model3_gb = GradientBoostingClassifier(n_estimators=50)
%time task1_model3_gb.fit(X_classification_train, y_classification_train)
print ("Trained model :: ", task1_model3_gb)
%time predictions_task1_model3_gb = task1_model3_gb.predict(X_classification_test)
from sklearn.ensemble import RandomForestRegressor
task2_model1_rf = RandomForestRegressor()
%time task2_model1_rf.fit(X_regression_train, y_regression_train)
print ("Trained model :: ", task2_model1_rf)
%time predictions_task2_model1_rf = task2_model1_rf.predict(X_regression_test)
print(predictions_task2_model1_rf)
from sklearn.linear_model import BayesianRidge
task2_model2_br = BayesianRidge()
%time task2_model2_br.fit(X_regression_train, y_regression_train)
print ("Trained model :: ", task2_model2_br)
%time predictions_task2_model2_br = task2_model2_br.predict(X_regression_test)
print(predictions_task2_model2_br)
import xgboost as xgb
task2_model3_xgb = xgb.XGBRegressor()
%time task2_model3_xgb.fit(X_regression_train, y_regression_train)
print ("Trained model :: ", task2_model3_xgb)
%time predictions_task2_model3_xgb = task2_model3_xgb.predict(X_regression_test)
print(predictions_task2_model3_xgb)
Points: 10
Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
</span>
# Plot non-normalized confusion matrix
plot_confusion_matrix(y_classification_test, predictions_task1_model1_rf, classes=['Fatal', 'Serious', 'Slight'],
title='Confusion matrix, without normalization')
# Plot non-normalized confusion matrix
plot_confusion_matrix(y_classification_test, predictions_task1_model2_knn, classes=['Fatal', 'Serious', 'Slight'],
title='Confusion matrix, without normalization')
# Plot non-normalized confusion matrix
plot_confusion_matrix(y_classification_test, predictions_task1_model3_gb, classes=['Fatal', 'Serious', 'Slight'],
title='Confusion matrix, without normalization')
classes = ['Fatal', 'Serious', 'Slight']
rf_accuracy = per_class_accuracy(y_classification_test, predictions_task1_model1_rf)
knn_accuracy = per_class_accuracy(y_classification_test, predictions_task1_model2_knn)
gb_accuracy = per_class_accuracy(y_classification_test, predictions_task1_model3_gb)
trace1 = go.Bar(
x=classes,
y=rf_accuracy,
name='Random Forest'
)
trace2 = go.Bar(
x=classes,
y=knn_accuracy,
name='KNN'
)
trace3 = go.Bar(
x=classes,
y=gb_accuracy,
name='Gradient Boost'
)
fig = tools.make_subplots(rows=1, cols=3)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig['layout'].update(height=600, width=800, title='Comparison of Accuracy between classification models',
xaxis=dict(title='Models'),
yaxis=dict(title='Accuracy'))
py.iplot(fig)
Comparing the accuracy scores of the models, here are the observations:
Among all the models, based on accuracy alone, KNN seems to be better performing model for the provided data.
from sklearn.metrics import precision_recall_fscore_support
rf_scores = precision_recall_fscore_support(y_classification_test, predictions_task1_model1_rf, average='weighted')
print(rf_scores)
print(rf_scores[0:3])
print(type(rf_scores))
from sklearn.metrics import precision_recall_fscore_support
scores = ['Precision', 'Recall', 'F1 Score']
rf_scores = precision_recall_fscore_support(y_classification_test, predictions_task1_model1_rf, average='weighted')
knn_scores = precision_recall_fscore_support(y_classification_test, predictions_task1_model2_knn, average='weighted')
gb_scores = precision_recall_fscore_support(y_classification_test, predictions_task1_model3_gb, average='weighted')
trace1 = go.Bar(
x=scores,
y=rf_scores[0:3],
name='Random Forest'
)
trace2 = go.Bar(
x=scores,
y=knn_scores[0:3],
name='KNN'
)
trace3 = go.Bar(
x=scores,
y=gb_scores[0:3],
name='Gradient Boost'
)
fig = tools.make_subplots(rows=1, cols=3)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig['layout'].update(height=600, width=800, title='Comparison of Precision, Recall & F1 Score between classification models',
xaxis=dict(title='Models'),
yaxis=dict(title='Metrics'))
py.iplot(fig)
Precision, Recall and F1 Scores of all the classification models are comparable to each other.
Points: 10
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.
</span>
The goal of the comparison is to make sure all three models are evaluated using a consistent test harness. Since we had enough observations, we used the K-Fold validation with 10 splits to estimate the models performance on new data. Since we are comparing the regression models, we used R-Square metric to compare the models.
# Models array
# Code Reference: https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/
from sklearn import model_selection
models = []
models.append(('Random Forest', task2_model1_rf))
models.append(('Bayesian', task2_model2_br))
models.append(('XGBoost', task2_model3_xgb))
# prepare configuration for cross validation test harness
seed = 7
# evaluate each model in turn
results = []
names = []
scoring = 'r2'
for name, model in models:
%time kfold = model_selection.KFold(n_splits=10, random_state=seed)
%time cv_results = model_selection.cross_val_score(model, X_regression_train, y_regression_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
import plotly.offline as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
b1 = go.Box(
y=results[0],
name="Random Forest"
)
b2 = go.Box(
y=results[1],
name="Bayesian"
)
b3 = go.Box(
y=results[2],
name="XGBoost"
)
data = [b1, b2, b3]
py.iplot(data)
Based on the cross validation scores, XGBoost regressor performed better than the Random Forest and Bayesian models. The main reason is because of the fact that XGBoost is an ensemble algorithm as it uses multiple decision trees.
The following table compares the other metrics including Explained Variance Score, Mean Absolute Error and Mean Squared Error. XGBoost edges the other models in those metrics as well.
metrics = ['Explained Variance Score', 'R-Square', 'Mean Absolute Error', 'Mean Squared Error']
models = [' ', 'Random Forest Regressor', 'Bayesian Ridge Regressor', 'XGBoost Regressor']
rf_metrics = [mt.explained_variance_score(y_regression_test, predictions_task2_model1_rf),
mt.r2_score(y_regression_test, predictions_task2_model1_rf),
mt.mean_absolute_error(y_regression_test, predictions_task2_model1_rf),
mt.mean_squared_error(y_regression_test, predictions_task2_model1_rf)]
br_metrics = [mt.explained_variance_score(y_regression_test, predictions_task2_model2_br),
mt.r2_score(y_regression_test, predictions_task2_model2_br),
mt.mean_absolute_error(y_regression_test, predictions_task2_model2_br),
mt.mean_squared_error(y_regression_test, predictions_task2_model2_br)]
xgb_metrics = [mt.explained_variance_score(y_regression_test, predictions_task2_model3_xgb),
mt.r2_score(y_regression_test, predictions_task2_model3_xgb),
mt.mean_absolute_error(y_regression_test, predictions_task2_model3_xgb),
mt.mean_squared_error(y_regression_test, predictions_task2_model3_xgb)]
trace = go.Table(
header=dict(values=list(models),
fill = dict(color='#C2D4FF'),
align = ['left'] * 5),
cells=dict(values=[metrics, rf_metrics, br_metrics, xgb_metrics],
fill = dict(color='#F5F8FF'),
align = ['left'] * 5))
data = [trace]
py.iplot(data)
Points: 10
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
</span>
df3 = pd.DataFrame(task1_model1_rf.feature_importances_, X_classification.columns)
df4 = df3.sort_values([0], ascending=False)
df5 = df4.head(15)
df5
Based on the feature importances, the Random Forest Classifier uses Latitude, Longitude, Date of the month and the age of the vehicles as some of the top features. Some of the highlights:
df3 = pd.DataFrame(task2_model3_xgb.feature_importances_, X_regression.columns)
df4 = df3.sort_values([0], ascending=False)
df5 = df4.head(15)
df5
Based on the feature importances, the XGBoost Regressor uses Latitude, Longitude, Vehicle Type as some of the top features. Some of the highlights:
Points: 5
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?
</span>
The classfication models that predict the severity of the accident perform very well with good accuracy. The models can be used in couple of scenarios:
The accident response team can use the models to predict the severity of the accidents based on the accident data recorded by the IoT devices, witnesses and the first responders. This can provide valuable notice to the hospitals and other services like the fire departments to be ready to deal with casualities.
Modified models can provide valuable analysis of the past accidents and can help the city planning teams to take accident prevention steps including installing new traffic signals in accident prone areas and provide warning signs to drivers.
The regression models predict the engine capacity of the vehicles involved in the accidents. Even though the practical use of the predictive capabilities are limited, the exercise provided us a good opportunity to prepare the data for regression, develop predictive regression models and also evaluate the performance of the regression models.
To measure the value of the models the user needs to consider the criteria detailed above as well as the cost implications involved in implementing changes based on the outcomes. Specifically, regarding the classification models, these scenarios involve a great deal of resources beyond financial, and these resource implications also need to be evaluated.
To further improve the models, more data on traffic accidents in these areas can be collected over several years. The more data that is gathered, the greater the effectiveness of the models, and the better outcomes/decisions can be made as to the best path of accident mitigation.
Points: 10
You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm.
</span>
#
# Code Reference: https://plot.ly/scikit-learn/plot-bayesian-ridge/
#
import plotly.offline as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
lw = 2
w = np.zeros(X_regression_train.shape[1])
p1 = go.Scatter(y=[float(i)/sum(task2_model2_br.coef_) for i in task2_model2_br.coef_],
mode='lines',
line=dict(color='lightgreen', width=lw),
name="Bayesian Ridge")
p2 = go.Scatter(y=w,
mode='lines',
line=dict(color='gold', width=lw),
name="Ground truth")
p3 = go.Scatter(y=[float(i)/sum(task2_model1_rf.feature_importances_) for i in task2_model1_rf.feature_importances_],
mode='lines',
line=dict(color='navy'),
name="Random Forest")
p4 = go.Scatter(y=[float(i)/sum(task2_model3_xgb.feature_importances_) for i in task2_model3_xgb.feature_importances_],
mode='lines',
line=dict(color='red'),
name="XGBoost")
layout = go.Layout(title="Weights of the model",
xaxis=dict(title="Features"),
yaxis=dict(title="Values of the weights")
)
fig = go.Figure(data=[p1, p2, p3, p4], layout=layout)
py.iplot(fig)
print(X_regression.columns[170], ', ', X_regression.columns[55], ', ', X_regression.columns[56], ', ', X_regression.columns[57], ', ', X_regression.columns[61])
Based on the weights, the following are the attributes that play a significant role in all the regression models:
These attributes are the key characteristics of the vehicle and the models identified them as key attributes in predicting the engine capacity.